Search CORE

214 research outputs found

Clustering sensory inputs using NeuroEvolution of augmenting topologies

Author: Goudbeek Martijn
Halkidi M.
Hsu Yen-Chang
Raue Federico
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/07/2018
Field of study

Crossref

The IT University of Copenhagen's Repository

An Approach to Web-Scale Named-Entity Disambiguation

Author: C. Whitelaw
I. Bhattacharya
L. Sarmento
M. Halkidi
M. Meilă
P. Pantel
S. Dill
S. Guha
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

We present a multi-pass clustering approach to large scale. wide-scope named-entity disambiguation (NED) oil collections of web pages. Our approach Uses name co-occurrence information to cluster and hence disambiguate entities. and is designed to handle NED on the entire web. We show that on web collections, NED becomes increasing), difficult as the corpus size increases, not only because of the challenge of scaling the NED algorithm, but also because new and surprising facets of entities become visible in the data. This effect limits the potential benefits for data-driven approaches of processing larger data-sets, and suggests that efficient clustering-based disambiguation methods for the web will require extracting more specialized information front documents

Crossref

Repositório Aberto da Universidade do Porto

Spatial correlations in attribute communities

Author: A Barrat
A De Montis
A Decelle
A Lancichinetti
AK Jain
Alessandro Chessa
B Karrer
D Grady
D Hu
D Hu
F Calabrese
Federica Cerina
G Daraganova
L Danon
L Denoeud
M Barthelemy
M Chavez
M Halkidi
MA Porter
Marc Barthelemy
MEJ Newman
P Expert
P Kaluza
R Guimerà
R Guimerá
RandW
RJGB Campello
S Erlander
S Fortunato
S Fortunato
S Gregory
Sergio Gómez
VD Blondel
Vincenzo De Leo
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2012
Field of study

Community detection is an important tool for exploring and classifying the properties of large complex networks and should be of great help for spatial networks. Indeed, in addition to their location, nodes in spatial networks can have attributes such as the language for individuals, or any other socio-economical feature that we would like to identify in communities. We discuss in this paper a crucial aspect which was not considered in previous studies which is the possible existence of correlations between space and attributes. Introducing a simple toy model in which both space and node attributes are considered, we discuss the effect of space-attribute correlations on the results of various community detection methods proposed for spatial networks in this paper and in previous studies. When space is irrelevant, our model is equivalent to the stochastic block model which has been shown to display a detectability-non detectability transition. In the regime where space dominates the link formation process, most methods can fail to recover the communities, an effect which is particularly marked when space-attributes correlations are strong. In this latter case, community detection methods which remove the spatial component of the network can miss a large part of the community structure and can lead to incorrect results.Comment: 10 pages and 7 figure

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Archivio istituzionale della ricerca - Università di Cagliari

IMT Institutional Repository

Factors Affecting Web Page Similarity

Author: A. Tombros
A. Tombros
J.M. Kleinberg
M. Halkidi
N. Jardine
P. Ganesan
S. Ozmutlu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2005
Field of study

Abstract. Tools that allow effective information organisation, access and navigation are becoming increasingly important on the Web. Sim-ilarity between web pages is a concept that is central to such tools. In this paper, we examine the effect that content and layout-related as-pects of web pages have on web page similarity. We consider the textual content contained within common HTML tags, the structural layout of pages, and the query terms contained within pages. Our study shows that combinations of factors can yield more promising results than individual factors, and that different aspects of web pages affect similarities between pages in a different manner. We found a number of factors that, when taken into account, can result in effective measures of similarity between web pages. Query information in particular, proved to be important for the effective organisation of web pages.

CiteSeerX

Crossref

Measuring player’s behaviour change over time in public goods game

Author: A Tsymbal
AP Bradley
C Keser
E Rendón
G Widmer
L Vendramin
L Xiaofeng
M Halkidi
M Spiliopoulou
MJ Zaki
P Kalnis
R Elwell
S Günnemann
T Fawcett
U Fischbacher
U Fischbacher
U Fischbacher
X Robin
Publication venue
Publication date: 09/09/2016
Field of study

An important issue in public goods game is whether player's behaviour changes over time, and if so, how significant it is. In this game players can be classified into different groups according to the level of their participation in the public good. This problem can be considered as a concept drift problem by asking the amount of change that happens to the clusters of players over a sequence of game rounds. In this study we present a method for measuring changes in clusters with the same items over discrete time points using external clustering validation indices and area under the curve. External clustering indices were originally used to measure the difference between suggested clusters in terms of clustering algorithms and ground truth labels for items provided by experts. Instead of different cluster label comparison, we use these indices to compare between clusters of any two consecutive time points or between the first time point and the remaining time points to measure the difference between clusters through time points. In theory, any external clustering indices can be used to measure changes for any traditional (non-temporal) clustering algorithm, due to the fact that any time point alone is not carrying any temporal information. For the public goods game, our results indicate that the players are changing over time but the change is smooth and relatively constant between any two time points

Nottingham ePrints

arXiv.org e-Print Archive

Nottingham eTheses

Crossref

An effective non-parametric method for globally clustering genes from expression profiles

Author: AK Jain
F Azuaje
G Sherlock
Gang Li
Jingyu Hou
M Halkidi
MS Aldenderfer
MT Özsu
PC Boutros
PT Spellman
R Simon
R Tibshirani
RB Altman
RJ Hathaway
RR Sokal
S Raychaudhuri
SM Tseng
T Zhang
VS Tseng
Wanlei Zhou
Wei Shi
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2007
Field of study

Clustering is widely used in bioinformatics to find gene correlation patterns. Although many algorithms have been proposed, these are usually confronted with difficulties in meeting the requirements of both automation and high quality. In this paper, we propose a novel algorithm for clustering genes from their expression profiles. The unique features of the proposed algorithm are twofold: it takes into consideration global, rather than local, gene correlation information in clustering processes; and it incorporates clustering quality measurement into the clustering processes to implement non-parametric, automatic and global optimal gene clustering. The evaluation on simulated and real gene data sets demonstrates the effectiveness of the algorithm. <br /

Deakin Research Online

Crossref

A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences

Author: A Lempel
A Puglisi
Andrew K Benson
CG Nevill-Manning
David J Russell
DR Bastola
E Ukkonen
EK Costello
EM McCreight
HH Otu
J Ziv
J Ziv
JD Parsons
JD Thompson
Khalid Sayood
L Holm
M Charikar
M Halkidi
P Weiner
RC Edgar
Samuel F Way
SF Altschul
W Li
W Li
W Li
WJ Wilbur
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Background: We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster, a new cluster is created. Results: The performance of the proposed algorithm is validated via comparison to the popular DNA/RNA sequence clustering approach, CD-HIT-EST, and to the recently developed algorithm, UCLUST, using two different sets of 16S rDNA sequences from 2,255 genera. The proposed algorithm maintains a comparable CPU execution time with that of CD-HIT-EST which is much slower than UCLUST, and has successfully generated clusters with higher statistical accuracy than both CD-HIT-EST and UCLUST. The validation results are especially striking for large datasets. Conclusions: We introduce a fast and accurate clustering algorithm that relies on a grammar-based sequence distance. Its statistical clustering quality is validated by clustering large datasets containing 16S rDNA sequences

Crossref

DigitalCommons@University of Nebraska

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Clustering daily patterns of human activities in the city

Author: A Harvey
AH Maslow
C Song
CM Bishop
CR Bhat
D Balcan
FS Chapin
GS Becker
GS Becker
GS Becker
H Ralambondrainy
H Yu
H Zha
IT Jolliffe
J Candia
JC Dunn
JL Bowman
Joseph Ferreira
KW Axhausen
M Batty
M Ben-Akiva
M Brun
M Halkidi
M Turk
M-P Kwan
Marta C. González
MF Goodchild
N Eagle
P Waddell
P Wang
PJ Rousseeuw
PJ Taylor
Q Shen
R Crane
R Durrett
RO Duda
S Freud
S Greaves
S Gupta
S Hanson
S Sang
Shan Jiang
T Hastie
T Hägerstrand
X Wu
Z Huang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/05/2011
Field of study

Data mining and statistical learning techniques are powerful analysis tools yet to be incorporated in the domain of urban studies and transportation research. In this work, we analyze an activity-based travel survey conducted in the Chicago metropolitan area over a demographic representative sample of its population. Detailed data on activities by time of day were collected from more than 30,000 individuals (and 10,552 households) who participated in a 1-day or 2-day survey implemented from January 2007 to February 2008. We examine this large-scale data in order to explore three critical issues: (1) the inherent daily activity structure of individuals in a metropolitan area, (2) the variation of individual daily activities—how they grow and fade over time, and (3) clusters of individual behaviors and the revelation of their related socio-demographic information. We find that the population can be clustered into 8 and 7 representative groups according to their activities during weekdays and weekends, respectively. Our results enrich the traditional divisions consisting of only three groups (workers, students and non-workers) and provide clusters based on activities of different time of day. The generated clusters combined with social demographic information provide a new perspective for urban and transportation planning as well as for emergency response and spreading dynamics, by addressing when, where, and how individuals interact with places in metropolitan areas.Massachusetts Institute of Technology. Dept. of Urban Studies and PlanningUnited States. Dept. of Transportation (Region One University Transportation Center)Singapore-MIT Alliance for Research and Technolog

DSpace@MIT

Crossref

A genetic approach for building different alphabets for peptide and protein classification

Author: A Kontijevskis
A Martin
A Narayanan
Alessandra Lumini
D Sarda
DR Madden
GL Zhang
GZ Liang
H Ogul
HB Shen
I Bozic
J Chen
J Hammer
J Huang
JJ Chou
JJ Chou
KC Chou
KC Chou
L Huang
L Nanni
L Nanni
L Nanni
Loris Nanni
LR Murphy
M Halkidi
M Milik
MC Honeyman
N Cristianini
R Duda
T Fawcett
T Rögnvaldsson
T Rögnvaldsson
T Sturniolo
V Brusic
Y Zhao
YD Cai
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background In this paper, it is proposed an optimization approach for producing reduced alphabets for peptide classification, using a Genetic Algorithm. The classification task is performed by a multi-classifier system where each classifier (Linear or Radial Basis function Support Vector Machines) is trained using features extracted by different reduced alphabets. Each alphabet is constructed by a Genetic Algorithm whose objective function is the maximization of the area under the ROC-curve obtained in several classification problems. Results The new approach has been tested in three peptide classification problems: HIV-protease, recognition of T-cell epitopes and prediction of peptides that bind human leukocyte antigens. The tests demonstrate that the idea of training a pool classifiers by reduced alphabets, created using a Genetic Algorithm, allows an improvement over other state-of-the-art feature extraction methods. Conclusion The validity of the novel strategy for creating reduced alphabets is demonstrated by the performance improvement obtained by the proposed approach with respect to other reduced alphabets-based methods in the tested problems.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Archivio istituzionale della ricerca - Università di Padova

A Normalized Tree Index for identification of correlated clinical parameters in microarray experiments

Author: A Goldhirsch
A Tauchen
A Tauchen
Anika Tauchen
Anke Becker
C Martin
C Sotiriou
Christian W Martin
CM Perou
E Huang
GA Pavlopoulos
H Wang
J Handl
J Quackenbush
J Wang
Kendall
L Sachs
LJ van't Veer
M Halkidi
MB Eisen
MF Ochs
MJ van de Vijver
NA Samaan
NL Johnson
RA Fisher
S Datta
S Loi
S Tsumoto
T Decker
T Sorlie
Tim W Nattkemper
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Martin C, Tauchen A, Becker A, Nattkemper TW. A Normalized Tree Index for identification of correlated clinical parameters in microarray data. BioData Mining. 2011;4(1): 2.BACKGROUND: Measurements on gene level are widely used to gain new insights in complex diseases e.g. cancer. A promising approach to understand basic biological mechanisms is to combine gene expression profiles and classical clinical parameters. However, the computation of a correlation coefficient between high-dimensional data and such parameters is not covered by traditional statistical methods. METHODS: We propose a novel index, the Normalized Tree Index (NTI), to compute a correlation coefficient between the clustering result of high-dimensional microarray data and nominal clinical parameters. The NTI detects correlations between hierarchically clustered microarray data and nominal clinical parameters (labels) and gives a measurement of significance in terms of an empiric p-value of the identified correlations. Therefore, the microarray data is clustered by hierarchical agglomerative clustering using standard settings. In a second step, the computed cluster tree is evaluated. For each label, a NTI is computed measuring the correlation between that label and the clustered microarray data. RESULTS: The NTI successfully identifies correlated clinical parameters at different levels of significance when applied on two real-world microarray breast cancer data sets. Some of the identified highly correlated labels confirm the actual state of knowledge whereas others help to identify new risk factors and provide a good basis to formulate new hypothesis. CONCLUSIONS: The NTI is a valuable tool in the domain of biomedical data analysis. It allows the identification of correlations between high-dimensional data and nominal labels, while at the same time a p-value measures the level of significance of the detected correlations

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Publications at Bielefeld University